단어 생성 이력을 이용한 요약문 생성의 어휘 반복 문제 해결

류재현; 노윤석; 최수정; 박세영; 박성배; Jaehyun Ryu; Yunseok Noh; Su Jeong Choi; Seyoung Park; Seong-Bae Park

연구문헌

국내 논문지

홈 > 연구문헌 > 국내 논문지 > 한국정보과학회 논문지 > 정보과학회논문지 (Journal of KIISE)

정보과학회논문지 (Journal of KIISE)

Current Result Document : 8 / 18 이전건 다음건

한글제목(Korean Title)	단어 생성 이력을 이용한 요약문 생성의 어휘 반복 문제 해결
영문제목(English Title)	Solving for Redundant Repetition Problem of Generating Summarization using Decoding History
저자(Author)	류재현 노윤석 최수정 박세영 박성배 Jaehyun Ryu Yunseok Noh Su Jeong Choi Seyoung Park Seong-Bae Park
원문수록처(Citation)	VOL 46 NO. 06 PP. 0535 ~ 0543 (2019. 06)
한글내용 (Korean Abstract)	시퀀스-투-시퀀스 기반의 요약 모델에서 자주 발생하는 문제 중 하나는 요약문의 생성과정에서 단어나 구, 문장이 불필요하게 반복적으로 생성되는 것이다. 이를 해결하기 위해 기존 연구들은 대부분 모델에 여러 모듈을 추가하는 방법을 제안했지만, 위 방법은 생성하지 말아야 하는 단어에 대한 학습이 부족하여 반복 생성 문제를 해결함에 있어 한계가 있다. 본 논문에서는 단어 생성 이력을 직접적으로 이용하여 반복 생성을 제어하는 Repeat Loss를 이용한 새로운 학습 방법을 제안한다. Repeat Loss를 디코더가 단어 생성 확률을 계산 했을 때 이전에 생성한 단어가 다시 생성될 확률로 정의함으로써 실제 생성한 단어가 반복 생성될 확률을 직접적으로 제어할 수 있다. 제안한 방법으로 요약 모델을 학습한 결과, 단어 반복이 줄어들어 양질의 요약을 생성하는 것을 실험적으로 확인할 수 있었다.
영문내용 (English Abstract)	Neural attentional sequence-to-sequence models have achieved great success in abstractive summarization. However, the model is limited by several challenges including repetitive generation of words, phrase and sentences in the decoding step. Many studies have attempted to address the problem by modifying the model structure. Although the consideration of actual history of word generation is crucial to reduce word repetition, these methods, however, do not consider the decoding history of generated sequence. In this paper, we propose a new loss function, called ‘Repeat Loss’ to avoid repetitions. The Repeat Loss directly prevents the model from repetitive generation of words by giving a loss penalty to the generation probability of words already generated in the decoding history. Since the propose Repeat Loss does not need a special network structure, the loss function is applicable to any existing sequence-to-sequence models. In experiments, we applied the Repeat Loss to a number of sequence-to-sequence model based summarization systems and trained them on both Korean and CNN/Daily Mail summarization datasets. The results demonstrate that the proposed method reduced repetitions and produced high-quality summarization.
키워드(Keyword)	문서 요약 반복 제어 시퀀스-투-시퀀스 손실 함수 text summarization sequence-to-sequence model word repetition repeat loss
파일첨부	PDF 다운로드